In this document we’re interested in generating predictions for every pairing, using Elo scores alone. We don’t expect this to do terribly well, but it’s a baseline.
Geddit?
Let’s read in the Elo ratings we downloaded from the Web.
elo_m <- read.csv('data/elo_ratings/atp22.csv')
head(elo_m)
| Rank | Player | Age | Elo | HardRaw | ClayRaw | GrassRaw | hElo | cElo | gElo | Peak.Match | Peak.Age | Peak.Elo |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Carlos Alcaraz | 19.0 | 2200.2 | 2028.5 | 2038.1 | 1441.4 | 2114.4 | 2119.2 | 1820.8 | 2022 Roland Garros R16 | 19.0 | 2221.9 |
| 2 | Novak Djokovic | 35.0 | 2168.6 | 2054.9 | 2022.3 | 1942.4 | 2111.7 | 2095.4 | 2055.5 | 2016 Miami F | 28.8 | 2470.0 |
| 3 | Alexander Zverev | 25.1 | 2131.3 | 2012.7 | 2043.9 | 1671.4 | 2072.0 | 2087.6 | 1901.4 | 2022 Atp Cup RR | 24.7 | 2158.4 |
| 4 | Rafael Nadal | 36.0 | 2095.8 | 1934.9 | 1967.6 | 1500.0 | 2015.3 | 2031.7 | 1797.9 | 2009 Madrid SF | 22.9 | 2370.0 |
| 5 | Jannik Sinner | 20.8 | 2070.0 | 1969.9 | 1896.4 | 1312.8 | 2020.0 | 1983.2 | 1691.4 | 2022 Madrid R32 | 20.7 | 2079.0 |
| 6 | Stefanos Tsitsipas | 23.8 | 2058.2 | 1899.1 | 2040.2 | 1572.4 | 1978.6 | 2049.2 | 1815.3 | 2021 Roland Garros SF | 22.8 | 2133.0 |
According to the web site, the gElo column is a 50/50
average between the all-surfaces Elo score and the grass-only Elo score.
That’s their recommendation, though we may or may not follow it.
We can convert Elo scores to Bradley–Terry abilities. Let \(a_i\) represent the Elo score of player \(i\). Then the probability that player \(i\) defeats player \(j\) (ignoring the possibility of a draw) is given by
\[ p_{ij} = 1 - \frac1{1 + 10^{(a_i - a_j)/400}} = \frac{10^{a_i/400}}{ 10^{a_i/400} + 10^{a_j/400}}, \] so that \[ \frac{p_{ij}}{p_{ji}} = \frac{10^{a_i/400}}{10^{a_j/400}}, \] hence \[ \ln \frac{p_{ij}}{p_{ji}} = a_i \frac{\ln 10}{400} - a_j \frac{\ln 10}{400}. \] Thus one can convert an Elo rating into a Bradley–Terry (log)-score by multiplying it by \(\frac1{400}\ln 10\).
elo_prob <- function(a1, a2) {
1 / (1 + 10^((a2 - a1) / 400))
}
So the probability that Carlos Alcaraz beats Novak Djokovic is
elo_prob(elo_m[1, 'Elo'], elo_m[2, 'Elo'])
## [1] 0.5453511
Now to make some predictions. Let’s pull in the template with all the
pairings. Watch out for non-Ascii letters! (Make sure to set the
encoding to UTF-8.)
template <- read.csv('submission-template.csv', encoding = 'UTF-8')
head(template, 10)
| player1_name | player2_name | player1_id | player2_id | Gender | p_player1_win | p_player2_win |
|---|---|---|---|---|---|---|
| iga swiatek | barbora krejcikova | 1 | 2 | W | NA | NA |
| iga swiatek | paula badosa | 1 | 3 | W | NA | NA |
| iga swiatek | maria sakkari | 1 | 4 | W | NA | NA |
| iga swiatek | anett kontaveit | 1 | 5 | W | NA | NA |
| iga swiatek | karolina pliskova | 1 | 6 | W | NA | NA |
| iga swiatek | ons jabeur | 1 | 7 | W | NA | NA |
| iga swiatek | aryna sabalenka | 1 | 8 | W | NA | NA |
| iga swiatek | danielle collins | 1 | 9 | W | NA | NA |
| iga swiatek | garbiñe muguruza | 1 | 10 | W | NA | NA |
| iga swiatek | jessica pegula | 1 | 11 | W | NA | NA |
Are all the male players in our table?
library(dplyr)
men <- with(subset(template, Gender == 'M'),
union(player1_name, player2_name))
Watch out for encodings or invisible unicode characters (like non-breaking spaces) in data that’s scraped from the web!
I have fixed this in webscraping.Rmd so the following
should now work.
'novak djokovic' %in% men
## [1] TRUE
tolower(elo_m$Player[2])
## [1] "novak djokovic"
'novak djokovic' == tolower(elo_m$Player[2])
## [1] TRUE
Or is it non-breaking spaces?
stringi::stri_escape_unicode('novak djokovic')
## [1] "novak djokovic"
stringi::stri_escape_unicode(elo_m$Player[2])
## [1] "Novak Djokovic"
Now, who in the submission template is missing from the scraped Elo ratings?
men[!men %in% tolower(elo_m$Player)]
## [1] "felix auger-aliassime" "albert ramos-vinolas" "roger federer"
## [4] "soonwoo kwon" "jan-lennard struff"
Are they really missing?
library(stringr)
str_subset(elo_m$Player, 'elix') # hyphens
## [1] "Felix Auger Aliassime"
str_subset(elo_m$Player, 'amos') # double-barrelled name
## [1] "Albert Ramos"
str_subset(elo_m$Player, 'oger')
## character(0)
str_subset(elo_m$Player, 'won') # spacing
## [1] "Soon Woo Kwon"
str_subset(elo_m$Player, 'nnard') # hyphens
## [1] "Jan Lennard Struff"
And now the same for women:
elo_w <- read.csv('data/elo_ratings/wta22.csv')
women <- with(subset(template, Gender == 'W'),
union(player1_name, player2_name))
women[!women %in% tolower(elo_w$Player)]
## [1] "garbiñe muguruza" "coco gauff" "alizé cornet"
## [4] "elena-gabriela ruse" "irina-camelia begu" "xinyu wang"
Diagnose the issues:
str_subset(elo_w$Player, 'arbi') # diacritics
## [1] "Garbine Muguruza"
str_subset(elo_w$Player, 'auf') # different forename?
## [1] "Cori Gauff"
str_subset(elo_w$Player, 'orne') # diacritics
## [1] "Alize Cornet"
str_subset(elo_w$Player, 'a ?[Gg]abr') # hyphenation
## [1] "Elena Gabriela Ruse"
str_subset(elo_w$Player, 'elia') # hyphenation
## [1] "Irina Camelia Begu"
str_subset(elo_w$Player, 'in ?[Yy]u') # spacing
## [1] "Xin Yu Wang"